flowchart TD
A[Load Data: matches.csv<br>team_appearances.csv<br>player_appearances.csv<br>goals.csv] --> B[Matches Data]
A --> C[Team Appearances Data]
A --> D[Player Appearances Data]
A --> E[Goals Data]
B & C & D & E -->|Combine Data| F[Combined Data]
F -->|Exploratory Data Analysis| G[EDA]
G -->|Check for Missing Values| H[Missing Values]
G -->|Visualize Score Distribution| I[Score Distribution]
G -->|Analyze Correlation| J[Correlation Analysis]
G -->|Explore Result Patterns| K[Result Patterns]
G -->|Investigate Goals Distribution| L[Goals Distribution]
G -->|Assess Predictor Variables| M[Predictor Variables Analysis]
F -->|Qualitative Analysis| N[Qualitative Analysis]
N -->|Text Mining| O[Text Mining]
O -->|Identify Player Nationalities| P[Player Nationalities]
O -->|Explore Player Positions| Q[Player Positions]
O -->|Goal Distribution by Minute| R[Goals by Minute]
O -->|Generate Match Data Word Cloud| S[Match Data Word Cloud]
O -->|Generate Player Nationalities Word Cloud| T[Player Nationalities Word Cloud]
F -->|Quantitative Analysis| U[Quantitative Analysis]
U -->|Modeling| V[Modeling]
V -->|Prepare Data| W[Data Preparation]
V -->|Model Training and Testing | X[Train & Test]
V -->|Fit Multinomial Logistic Regression| Y[Model Fitting]
V -->|Predict and Evaluate Model| Z[Prediction & Model Evaluation]
V -->|Generate Summary| AA[Model Summary]
classDef blue fill:#9f9,stroke:#333,stroke-width:2px;
class B,C,D,E green;
%% Set the size of the flowchart
%%{init: {"themeVariables": {"flowchart": {"maxWidth": "100%", "maxHeight": "600px"}}}}
MS5130 Assignment 3
FIFA World Cup Data Analysis
Image Source: Pinterest
FIFA World Cup Data Analysis Flowchart:
Install and load the necessary packages and libraries:
options(repos = "https://cloud.r-project.org") # Set CRAN mirror
install.packages('readr')
install.packages('tidyverse')
install.packages("ISLR")
install.packages("gam")
install.packages("interactions")
install.packages("htmlwidgets")
library(readr)
library(tidyr)
library(dplyr)
library(leaflet)
library(plotly)
library(ggplot2)
library(forcats)
library(gam)
library(ISLR)
library(interactions)
library(htmlwidgets)Check and set the present working directory to the location where the datasets are present:
# Check the present working directory
getwd()[1] "C:/Users/Manoj/Desktop/Semester 2 Assignments/Applied Analytics in Business and Society/Assignment 3/FIFA"
# Set the working directory to the location where the datasets are present
setwd("C:/Users/Manoj/Desktop/Semester 2 Assignments/Applied Analytics in Business and Society/Assignment 3/FIFA")Step1: Data Preparation:
Import the datasets matches, team_appearances, player_appearances and goals using readr package:
#Import the datasets
matches <- read.csv("matches.csv")
team_appearances <- read.csv("team_appearances.csv")
player_appearances <- read.csv("player_appearances.csv")
goals <- read.csv("goals.csv")Combine the datasets matches, team_appearances, player_appearances and goals using common column match_id present in all four datasets:
# Combine datasets using common columns
combined_data <- matches %>%
left_join(team_appearances, by = "match_id") %>%
left_join(player_appearances, by = "match_id") %>%
left_join(goals, by = "match_id")
View(combined_data)
head(combined_data)
str(combined_data)
summary(combined_data)Step 2: Exploratory Data Analysis:
# Exploratory Data Analysis (EDA)
install.packages("corrplot")
library(corrplot)
install.packages("corrplot")
library(corrplot)# Check for missing values
missing_values <- colSums(is.na(combined_data))
print(missing_values) key_id.x tournament_id.x tournament_name.x
0 0 0
match_id match_name.x stage_name.x
0 0 0
group_name.x group_stage.x knockout_stage.x
0 0 0
replayed.x replay.x match_date.x
0 0 0
match_time.x stadium_id.x stadium_name.x
0 0 0
city_name.x country_name.x home_team_id
0 0 0
home_team_name home_team_code away_team_id
0 0 0
away_team_name away_team_code score
0 0 0
home_team_score away_team_score home_team_score_margin
0 0 0
away_team_score_margin extra_time.x penalty_shootout.x
0 0 0
score_penalties home_team_score_penalties away_team_score_penalties
0 0 0
result.x home_team_win away_team_win
0 0 0
draw.x key_id.y tournament_id.y
0 0 0
tournament_name.y match_name.y stage_name.y
0 0 0
group_name.y group_stage.y knockout_stage.y
0 0 0
replayed.y replay.y match_date.y
0 0 0
match_time.y stadium_id.y stadium_name.y
0 0 0
city_name.y country_name.y team_id.x
0 0 0
team_name.x team_code.x opponent_id
0 0 0
opponent_name opponent_code home_team.x
0 0 0
away_team.x goals_for goals_against
0 0 0
goal_differential extra_time.y penalty_shootout.y
0 0 0
penalties_for penalties_against result.y
0 0 0
win lose draw.y
0 0 0
key_id.x.x tournament_id.x.x tournament_name.x.x
1530 1530 1530
match_name.x.x match_date.x.x stage_name.x.x
1530 1530 1530
group_name.x.x team_id.y team_name.y
1530 1530 1530
team_code.y home_team.y away_team.y
1530 1530 1530
player_id.x family_name.x given_name.x
1530 1530 1530
shirt_number.x position_name position_code
1530 1530 1530
starter substitute captain
1530 1530 1530
key_id.y.y goal_id tournament_id.y.y
3266 3266 3266
tournament_name.y.y match_name.y.y match_date.y.y
3266 3266 3266
stage_name.y.y group_name.y.y team_id
3266 3266 3266
team_name team_code home_team
3266 3266 3266
away_team player_id.y family_name.y
3266 3266 3266
given_name.y shirt_number.y player_team_id
3266 3266 3266
player_team_name player_team_code minute_label
3266 3266 3266
minute_regulation minute_stoppage match_period
3266 3266 3266
own_goal penalty
3266 3266
Distribution of home_team_score and away_team_score:
# Distribution of home_team_score and away_team_score
plot <- ggplot(combined_data, aes(x = home_team_score, y = away_team_score)) +
geom_point() +
labs(x = "Home Team Score", y = "Away Team Score", title = "Distribution of Home and Away Team Scores") +
theme_minimal()
plotly::ggplotly(plot)The above scatter plot of “Distribution of Home and Away Team Scores” presents the following observations:
- The data points are widely dispersed across the plot, indicating a variety of score combinations for home and away teams.
- The range for the home team score is 0 to 10, while the away team score ranges from 0 to 6.
- The plot suggests that both high and low scores are possible for either team, regardless of whether they are playing at home or away.
Correlation matrix correlation matrix of match statistics:
# Correlation matrix correlation matrix of match statistics including home_team_score, away_team_score, goals_for, and goals_against.
correlation_matrix <- cor(combined_data[, c("home_team_score", "away_team_score", "goals_for", "goals_against")])
corrplot(correlation_matrix, method = "color")The above correlation matrix for match statistics provides several key insights:
Home Team Score and Goals For: There is a strong positive correlation, indicating that as the home team scores increase, the goals for also tend to increase. This suggests that the home team’s offensive performance is a significant contributor to their scoring.
Home Team Score and Goals Against: A negative correlation is observed, meaning that as the home team scores more, they tend to concede fewer goals. This could imply effective defensive play or control of the game when the home team is leading.
Away Team Score and Goals Against: There is a positive correlation, suggesting that when the away team scores, the goals against them are also higher. This might reflect a more open game where both teams are scoring.
Away Team Score and Goals For: A negative correlation is observed, which could indicate that when the away team is scoring, the home team’s goals for are lower, possibly due to the away team’s defensive strategy.
Goals For and Goals Against: These are negatively correlated, which is expected as teams that score more often concede less and vice versa.
Boxplot of home_team_score and away_team_score by result:
# Boxplot of home_team_score and away_team_score by result
plot <- ggplot(combined_data, aes(x = result.x, y = home_team_score)) +
geom_boxplot() +
labs(x = "Result", y = "Home Team Score", title = "Boxplot of Home Team Score by Result") +
theme_minimal()
plotly::ggplotly(plot)plot <- ggplot(combined_data, aes(x = result.x, y = away_team_score)) +
geom_boxplot() +
labs(x = "Result", y = "Away Team Score", title = "Boxplot of Away Team Score by Result") +
theme_minimal()
plotly::ggplotly(plot)Combining both the home and away team score box plots by match results, presents the following observations:
Home Team Wins: Home teams tend to score significantly higher, with a median score around 3, when they win. Away teams, on the other hand, score much less in their losses, with a median score around 1.
Away Team Wins: Away teams have a median score of about 2 when they win, indicating that they don’t need to score as many goals as home teams do to win. Home teams typically score very low, with a median score around 1, when they lose.
Draws: Both home and away teams have a median score of around 2 in draws, suggesting a more balanced outcome in terms of scoring.
Score Range and Variability: Home teams show a wider range of scores and greater variability, especially in games they win, which is indicated by the presence of several outliers. Away teams generally have a narrower score range across all results.
Outliers: There are outliers present in all categories for both home and away teams, indicating occasional matches with unusually high scores.
These observations suggest that home teams have a more pronounced advantage in scoring, particularly in matches they win. Away teams, while they tend to score fewer goals, can still secure wins with fewer goals scored. Draws tend to be more evenly matched in terms of scoring for both teams. The presence of outliers in both plots indicates that while there are general trends, individual matches can have unexpected outcomes.
Histograms to check normality of predictor variables:
# Histograms to check normality of predictor variables
histograms <- combined_data %>%
select(home_team_score, away_team_score, goals_for, goals_against) %>%
gather() %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.8) +
facet_wrap(~key, scales = "free") +
labs(x = "Value", y = "Frequency", title = "Histograms of Predictor Variables") +
theme_minimal()
plotly::ggplotly(histograms)The Histograms show the distribution of four predictor variables: away_team_score, goals_against, goals_for, and home_team_score. Below are some observations that can be made from the histograms:
Skewness:
- Away_team_score: The histogram for away_team_score is skewed to the right. This means that there are more games with lower away team scores and fewer games with very high away team scores.
- Goals_against: The histogram for goals_against is also skewed to the right, but to a lesser extent than away_team_score. This suggests that there are more games with fewer goals scored against a team and fewer games with very high goals scored against a team.
- Goals_for: The histogram for goals_for is also skewed to the right, but less so than away_team_score. This suggests that there are more games with fewer goals scored by a team and fewer games with very high goals scored by a team.
- Home_team_score: The histogram for home_team_score is somewhat symmetrical, with a slight skew to the right. This suggests that the distribution of home team scores is more evenly spread out, but there are still slightly more games with lower home team scores.
The histograms for all four variables appear to have a single mode, which is the most frequent value.
Overall, the histograms suggest that the data is not perfectly normally distributed. All four variables have some degree of positive skew, meaning that there are more frequent lower values than higher values. This is a common finding in sports data, where scores tend to be clustered at the low end but can vary widely on the high end.
Step 3: Quantitative Analysis (Modeling)
For the quantitative analysis, let us use a multinomial logistic regression model using the mgcv and nnet libraries in R. The process involved several steps as outlined below:
- Data Preparation:
- The outcome variable
result.ywas transformed into a factor with predefined levels: “win”, “lose”, and “draw”. - Unique values in
result.ywere verified to ensure proper encoding. - Missing values in
result.ywere checked and accounted for.
- The outcome variable
# Quantitative Analysis
# Modeling
library(mgcv)
library(nnet)
# Convert result.y variable to a factor with specified levels
combined_data$result.y <- factor(combined_data$result.y, levels = c("win", "lose", "draw"))
# Check unique values in result.y
unique(combined_data$result.y)[1] win lose draw
Levels: win lose draw
# Check for missing values in result.y
sum(is.na(combined_data$result.y))[1] 0
- Model Training and Testing:
- The dataset was split into training and test sets with a ratio of 70:30, respectively.
- Random indices were generated to create the training dataset.
- The presence of all outcome categories, including “draw”, was confirmed in the training data.
- The reference category for modeling was set to “draw” using
relevel().
# Set seed for reproducibility
set.seed(123)
# Sample size for training data, 70% training and 30% test data
train_size <- floor(0.7 * nrow(combined_data))
# Generate random indices for training data
train_indices <- sample(seq_len(nrow(combined_data)), size = train_size)
# Create training and test datasets
train_data <- combined_data[train_indices, ]
test_data <- combined_data[-train_indices, ]
# Check unique values in result.y in train_data
unique(train_data$result.y)[1] draw lose win
Levels: win lose draw
# Check if the "draw" category is present in the model
"draw" %in% levels(train_data$result.y)[1] TRUE
# Set the reference category to "draw"
train_data$result.y <- relevel(train_data$result.y, ref = "draw")- Model Fitting:
- The multinomial logistic regression model was fitted using predictors such as home team score, away team score, and goals for.
# Fit the multinomial logistic regression model
model <- multinom(result.y ~ home_team_score + away_team_score + goals_for, data = train_data)# weights: 15 (8 variable)
initial value 77088.525684
iter 10 value 26595.433583
iter 20 value 7838.517378
iter 30 value 7604.920126
final value 7601.906689
converged
- Prediction and Evaluation:
- Predictions were made on the test data using the fitted model.
- The accuracy of the predictions was calculated to assess the model’s performance.
# Predict on test data
predictions <- predict(model, newdata = test_data)
# Check levels of result.y after modeling
levels(train_data$result.y)[1] "draw" "win" "lose"
# Calculate accuracy
accuracy <- mean(predictions == test_data$result.y)# Print accuracy in %
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))[1] "Accuracy: 96.88 %"
# Summary of the model
summary(model)Call:
multinom(formula = result.y ~ home_team_score + away_team_score +
goals_for, data = train_data)
Coefficients:
(Intercept) home_team_score away_team_score goals_for
win -2.126455 -14.98860 -14.19132 29.09556
lose -2.157068 13.75401 12.72983 -26.55352
Std. Errors:
(Intercept) home_team_score away_team_score goals_for
win 0.05589458 8.368049 4.394396 9.451744
lose 0.05644698 3.643451 2.826032 4.611278
Residual Deviance: 15203.81
AIC: 15219.81
Summary of Multinomial Logistic Regression Model:
Coefficients:
The coefficients represent the estimated effects of each predictor variable on the log-odds of the outcome classes (win and lose).
- Intercept:
- For the “win” class, the intercept is -2.126455.
- For the “lose” class, the intercept is -2.157068.
- Home Team Score (home_team_score):
- For the “win” class, the coefficient is -14.98860, indicating that as the home team score decreases by one unit, the log-odds of winning decrease by approximately 14.99.
- For the “lose” class, the coefficient is 13.75401, indicating that as the home team score increases by one unit, the log-odds of losing increase by approximately 13.75.
- Away Team Score (away_team_score):
- For the “win” class, the coefficient is -14.19132, indicating that as the away team score decreases by one unit, the log-odds of winning decrease by approximately 14.19.
- For the “lose” class, the coefficient is 12.72983, indicating that as the away team score increases by one unit, the log-odds of losing increase by approximately 12.73.
- Goals For (goals_for):
- For the “win” class, the coefficient is 29.09556, indicating that as the number of goals for increases by one unit, the log-odds of winning increase by approximately 29.10.
- For the “lose” class, the coefficient is -26.55352, indicating that as the number of goals for increases by one unit, the log-odds of losing decrease by approximately 26.55.
Standard Errors:
The standard errors represent the uncertainty in the estimated coefficients. Lower standard errors indicate more precise estimates.
- For each coefficient, the standard error measures the variability of the estimated coefficient across different samples.
Residual Deviance:
The residual deviance measures how well the model fits the data. Lower values indicate better fit. In our model case, the residual deviance is 15203.81.
AIC (Akaike Information Criterion):
The AIC is a measure of the relative quality of the model. Lower AIC values indicate better models. The AIC for this model is 15219.81.
Model Accuracy:
The accuracy of the model is approximately 96.88%. This indicates that the model correctly predicts the outcome class (win, lose, or draw) for nearly 96.88% of the observations in the test dataset.
Model Performance Analysis:
# Model Analysis
install.packages("pROC")
install.packages("pdp")
library(pROC)
library(pdp)Confusion Matrix:
# Confusion Matrix
conf_matrix <- table(test_data$result.y, predictions)
conf_matrix_plot <- plot_ly(z = ~as.matrix(conf_matrix), colorscale = "Viridis", type = "heatmap",
x = colnames(conf_matrix), y = rownames(conf_matrix),
colorbar = list(title = "Counts"))
conf_matrix_plot <- layout(conf_matrix_plot, title = "Confusion Matrix",
xaxis = list(title = "Predicted Class"),
yaxis = list(title = "Actual Class"))
conf_matrix_plotThe performance of the multinational machine learning model for predicting football match outcome was evaluated using a confusion matrix. The analysis revealed the following key observations:
- High Overall Accuracy:
- The model demonstrated good overall performance, with a high number of correct predictions on the diagonal of the confusion matrix.
- Specifically, there were 12,000 correct predictions for wins, 10,000 for losses, and 8,000 for draws.
- Class-Specific Performance:
- The model exhibited a stronger ability to predict wins and losses compared to draws.
- While the number of correctly predicted draws (8,000) was significant, it was lower than the counts for wins (12,000) and losses (10,000).
- Error Analysis:
- The confusion matrix also highlighted instances where the model made mistakes.
- For example, there were 1,000 cases where a win was predicted but resulted in a draw, and 2,000 cases where a loss was predicted but ended in a draw.
Prediction Distribution Plot:
# Prediction Distribution Plot
# Extract predicted probabilities for each class
probabilities <- predict(model, newdata = test_data, type = "probs")
# Plot histograms of predicted probabilities for each class
par(mfrow = c(1, length(levels(train_data$result.y))))
for (class in levels(train_data$result.y)) {
hist(probabilities[, class], main = paste("Predicted Probabilities for", class), xlab = "Probability")
}The performance of the multinational machine learning model for predicting football match outcome was evaluated using a prediction distribution plot. The analysis revealed the following key observations:
- Concentrated predictions around the correct outcome:
- The bars for each prediction (win, draw, lose) are concentrated around the high probability regions (towards 1.0 on the y-axis).
- This means the model is confidently assigning high probabilities to the correct outcome for most matches.
- Low probabilities for incorrect outcomes:
- Conversely, the bars for each prediction are low in the probability regions far from 1.0.
- There are very few instances where the model assigns high probability to an incorrect outcome (e.g., predicting a win with a high probability when the team actually loses).
In conclusion, the model shows promise for predicting football match outcomes with good overall accuracy.
Step 4: Qualitative Analysis (Text Mining)
Text mining on player names to identify common nationalities:
Let us analyze player names to predict their nationalities based on certain patterns in their family names. First select the unique player identifiers and family names from the dataset, apply rules using regular expressions to predict nationalities, counts players for each nationality, and creates an interactive bar chart visualizing the nationality distribution. The chart helps identify the most common nationalities among players based on their names.
# Qualitative Analysis
# Text mining on player names to identify common nationalities
player_nationalities <- combined_data %>%
select(player_id.x, family_name.x, given_name.x) %>%
distinct() %>%
mutate(nationality = case_when(
grepl("van", family_name.x, ignore.case = TRUE) ~ "Dutch",
grepl("al-", family_name.x, ignore.case = TRUE) ~ "Arabic",
grepl("sson", family_name.x, ignore.case = TRUE) ~ "Scandinavian",
grepl("sch", family_name.x, ignore.case = TRUE) ~ "German",
grepl("ez", family_name.x, ignore.case = TRUE) ~ "Spanish",
grepl("eau", family_name.x, ignore.case = TRUE) ~ "French",
grepl("ini", family_name.x, ignore.case = TRUE) ~ "Italian",
grepl("ic$", family_name.x, ignore.case = TRUE) ~ "Serbian",
grepl("son$", family_name.x, ignore.case = TRUE) ~ "Icelandic" # Identify Icelandic players
))
# Count the number of players for each nationality
nationality_counts <- table(player_nationalities$nationality)
# Create a data frame for plotting
nationality_data <- data.frame(nationality = names(nationality_counts),
count = as.numeric(nationality_counts))
# Reorder the levels of nationality in descending order of count
nationality_data$nationality <- factor(nationality_data$nationality,
levels = rev(nationality_data$nationality))
# Define custom colors for each nationality
custom_colors <- c(
"Dutch" = "#ff7f0e", # Orange
"Arabic" = "#008000", # Green
"Scandinavian" = "#ff69b4", # Pink
"German" = "#000000", # Black
"Spanish" = "#d62728", # Red
"French" = "#1f77b4", # Blue
"Italian" = "#0000ff", # Dark Blue
"Serbian" = "#7f7f7f", # Gray
"Icelandic" = "#87ceeb" # Sky Blue
)
# Reorder the custom colors accordingly
custom_colors <- custom_colors[levels(nationality_data$nationality)]
# Plotting an interactive bar chart
plot1 <- plot_ly(data = nationality_data, x = ~nationality, y = ~count, type = "bar",
marker = list(color = custom_colors[nationality_data$nationality])) %>%
layout(title = "Player Nationalities prediction based on their names",
xaxis = list(title = "Nationality", categoryorder = "total descending"), # Set categoryorder to total descending
yaxis = list(title = "Count"))
# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot1, "player_nationalities_by_their_name.html")
plot1Key observations:
The above analysis shows that Spanish names are the most prevalent, followed by Scandinavian, Dutch, Arabic, and German. Names suggesting Icelandic, Italian, Serbian, and French origins are less common.
Distribution of player positions using a bar chart:
Next let us analyze the distribution of player positions in the dataset. First group players by their positions, count the number of distinct players for each position, and arrange them in descending order. Then, create an interactive bar chart visualizing the count of players for each position, helping to understand the distribution of players across different positions in the dataset.
# Distribution of player positions using a bar chart
position_distribution <- combined_data %>%
group_by(position_name) %>%
summarise(distinct_player_count = n_distinct(player_id.x)) %>%
arrange(desc(distinct_player_count)) %>%
mutate(position_name = factor(position_name, levels = position_name))
# Plotting position distribution
plot2 <- position_distribution %>%
plot_ly(x = ~position_name, y = ~distinct_player_count, type = "bar") %>%
layout(title = "Count of Players by Position",
xaxis = list(title = "Position"),
yaxis = list(title = "Distinct Player Count"))
# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot2, "player_position_distribution.html")
plot2Key observations:
The above bar chart reveals that midfielders constitute the largest group, underscoring their pivotal role in gameplay. Defenders and forwards also represent significant proportions, reflecting the essential balance between offensive and defensive strategies in team composition. The chart effectively highlights the diversity of roles within a football team, with midfielders leading in numbers. This visualization aids in understanding the commonality of player positions and their strategic importance in the sport.
Goal Distribution by Minute using a line chart:
Now let analyze the distribution of goals scored over different regulation minutes in matches. Let us group the data by match ID, regulation minute, and goal ID to count distinct goals scored in each minute. Then, sum up the goals count for each minute and generates a line chart to visualize the goal distribution.
# Goal Distribution by Minute using a line chart
# Grouping data by match_id, minute_label, and goal_id to count distinct goals
goal_distribution <- combined_data %>%
filter(!is.na(minute_regulation)) %>%
group_by(match_id, minute_regulation, goal_id) %>%
summarise(goals_count = n_distinct(goal_id))
# Grouping data by minute_label to sum up the goals count for each minute
goal_distribution <- goal_distribution %>%
group_by(minute_regulation) %>%
summarise(goals_count = sum(goals_count))
# Create hover text
hover_text <- paste("Minute = ", goal_distribution$minute_regulation, "<br>Number of Goals = ", goal_distribution$goals_count)
# Plotting goal distribution by minute
plot3 <- plot_ly(x = ~goal_distribution$minute_regulation, y = ~goal_distribution$goals_count,
type = "scatter", mode = "lines+markers",
marker = list(color = 'rgba(65, 105, 225, .6)'),
line = list(shape = "spline"),
text = hover_text) %>%
layout(title = "Goal Distribution by Minute",
xaxis = list(title = "Minute"),
yaxis = list(title = "Goals Count"))
# Save the plot as a HTML widget
htmlwidgets::saveWidget(plot3, "goal_distribution.html")
plot3# Calculate total sum of goals scored over all minutes
total_goals <- sum(goal_distribution$goals_count)
# Print total sum of goals
print(paste("Total Goals Scored:", total_goals))[1] "Total Goals Scored: 2548"
Key observations:
The above plot indicates a fluctuating pattern of goal scoring, with a notable peak around the 90th minute, suggesting a surge in goals towards the match’s end. This trend could reflect the intensity of play during the final moments when teams often press for a decisive goal. The visualization captures the dynamic nature of goal scoring and can provide strategic insights into periods of high scoring probability within a game.
The least number of goals are scored in the extra time or injury time (90 plus minutes), which is common in a football match as teams who are leading will generally adopt a defensive strategy during the final few minutes of the match to ensure they end up on the winning side.
Text Analysis and Word Cloud Generation using Match Data:
#Word Cloud Generation using Match Data
install.packages("tm")package 'tm' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
install.packages("wordcloud")package 'wordcloud' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
install.packages("RColorBrewer")package 'RColorBrewer' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\Manoj\AppData\Local\Temp\Rtmp80ypoh\downloaded_packages
library(tm)
library(wordcloud)
library(RColorBrewer)This code segment performs text analysis on various columns of data, combining them into a single corpus. It then processes the text by removing numbers, punctuation, whitespace, converting to lowercase, and removing English stopwords. Afterward, it creates a document-term matrix and converts it into a data frame. Word frequencies are computed and sorted, and finally, a word cloud is generated based on the sorted word frequencies, showcasing the most common words in the combined text data.
# Combine text from multiple columns into a single corpus
text_corpus <- paste(combined_data$match_name.x, combined_data$stadium_name.x, combined_data$city_name.x, combined_data$country_name.x, combined_data$team_name.x, combined_data$team_name.y, combined_data$player_team_name)
# Create a corpus
docs <- Corpus(VectorSource(text_corpus))
# Clean the text
docs <- docs %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english"))
# Create a document-term matrix
dtm <- DocumentTermMatrix(docs)
# Convert the matrix to a data frame
dtm_df <- as.data.frame(as.matrix(dtm))
# Compute word frequencies
word_freq <- colSums(dtm_df)
# Sort words by frequency
sorted_word_freq <- sort(word_freq, decreasing = TRUE)
# Create a word cloud
wordcloud(words = names(sorted_word_freq), freq = sorted_word_freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))Key observations of what the word cloud indicates:
Countries and Teams: Words like Germany, Brazil, Argentina, and France are prominent, suggesting these are popular football nations often mentioned in the data.
Match Venues: The presence of words such as “stadium” and specific stadium names points to the various locations where significant football matches have been held.
Frequency Indication: The size of each word correlates with its frequency in the dataset, with larger words appearing more frequently, indicating their prominence in the context of international football matches.
This visualization effectively communicates the key aspects of the dataset, emphasizing the most common terms and providing insights into the focus areas within the data related to football matches. The word cloud serves as a visual summary, showcasing the relative importance of different countries, cities, and stadiums in the dataset.
Word Cloud of Player Nationalities in Combined Data:
The code extracts player nationalities from various columns in the combined data and creates a word cloud visualization to display the frequencies of these nationalities. The word cloud’s size represents each nationality’s frequency.
# Extract player nationalities
player_nationalities <- c(combined_data$player_team_name, combined_data$team_name.x, combined_data$team_name.y)
# Create a word cloud of player nationalities
wordcloud(player_nationalities,
min.freq = 5,
scale = c(5, 0.5),
colors = brewer.pal(8, "Dark2"),
random.order = TRUE,
main = "Player Nationalities Word Cloud")The word cloud visualization represents the diversity of player nationalities within the dataset. The size of each word is indicative of the frequency of that nationality’s occurrence, providing a quick visual representation of the most common nationalities among players.
Notably, countries such as Argentina, Germany, Italy, Belgium, and Neterlands are more prominent, suggesting a higher number of players hail from these nations. Conversely, smaller words denote nationalities that are less frequent within the data.
This visualization serves as a useful tool for discerning patterns at a glance, such as the dominance of European players if European countries are represented by larger words in the cloud.